Last November I spent twenty minutes trying to find a conversation where I'd worked through an embedding model comparison. I knew I'd had it, I knew roughly when, but I couldn't remember which file it was in or what words I'd used. Filename search was useless. Keyword search was marginally better but required guessing exact phrasing: did I write "vector database" or "embedding store" or "semantic index"? The information existed somewhere in thousands of documents (AI conversations, transcribed audio logs, diary entries, Discord messages), and I had no way to query it by meaning.

Architecture

The core pipeline: embed every chunk of text into a 384-dimensional vector using all-MiniLM-L6-v2 from sentence-transformers (about 80 MB on disk), store embeddings plus metadata in SQLite, and use HNSW via hnswlib for approximate nearest-neighbor lookup at query time. HNSW, Hierarchical Navigable Small World graphs, turns what would be a multi-second brute-force scan over tens of thousands of vectors into a millisecond-scale operation, with an accuracy trade-off so small I've never noticed it in practice.

pipeline:
  input: personal documents (AI chats, audio logs, diary, Discord)
  chunking: configurable (delimiter-based or date-pattern)
  embedding: all-MiniLM-L6-v2 (384 dimensions)
  index: HNSW via hnswlib
  metadata: SQLite
  interface: CLI + Python API

Each document type gets its own chunking logic. AI chats split on --- delimiters between exchanges. Audio transcriptions split on date-line patterns. Diary entries and Discord messages each have their own splitting rules. The model runs on CPU, loads in under a second, embeds fast enough for incremental updates where only new content gets processed.

Everything runs locally, no exceptions. No API keys, no subscriptions. The content is personal in the fullest sense: half-formed ideas, emotional processing, speculative reasoning, the kind of thing you write when you think nobody's watching. Sending it to a cloud API was never on the table.

Chunking

This is where the real effort went, and it's the part that matters most (more than embedding model choice, more than index parameters).

How do you split a 10,000-word AI conversation into pieces that each carry coherent meaning? Too large and the embedding becomes a vague average. Too small and a sentence fragment doesn't hold enough semantic content to match against anything useful. The sweet spot depends entirely on the document type: the --- delimiter works for AI chats because it aligns with natural exchange boundaries; date stamps work for audio transcriptions because they mark topic shifts; paragraph-level chunks work for diary entries because that's how I write them.

The failure mode is subtle: bad chunking doesn't produce errors, it produces poor search results. The system returns documents; they're just not the right ones. Debugging that means evaluating result quality by hand rather than reading a stack trace, which is slower and more uncertain than tracking down a proper crash.

A query like "that conversation about training neural networks on small datasets" returns the right document even though none of those exact words appear in it: the semantic compression in even a small model like MiniLM handles paraphrase matching well. The index covers years of records now and keeps growing. When I need to find something I half-remember, a single query usually gets me there.